Diabetic Medical Data Classification using Machine Learning Algorithms
Naresh. K, Prabakaran. N, Kannadasan. R, Boominathan. P*
School of Computer Science and Engineering, VIT University, Vellore-632014, India
*Corresponding Author E-mail: boomi051281@gmail.com
ABSTRACT:
Data mining is the process of analyzing data from different perspectives and summarizing it into a useful information. In this paper we propose a different classification algorithm to identify the accuracy on diabetic data sets. The diabetic person has risk and leads to other disease such as blood vessel damage, blindness, heart diseases, nerve damage and kidney diseases. Diabetics also classified as two types such as type insulin diabetes and non-insulin dependent, diabetes is a disease in which the blood glucose increases which is due to the defects of secretion of insulin, or its action or both. Diabetes is a prolonged medical disease. In diabetes the cells of person produce insufficient amount of insulin or defective insulin or may insulin or may unable use insulin properly and efficiently that further leads to hyperglycemia and type-2 diabetes. We are proposing an efficient two level for classifying data. During initial phase we use training data for analyzing the optimality of dataset then new dataset is formed as optimal training dataset now we apply our classification mechanism on new diabetic datasets. The data mining methods and techniques will be explored to identify suitable methods and techniques for efficient classification on diabetic data set and in mining it in useful patterns.
KEYWORDS: Data mining, diabetic dataset, Classification, Naive Bayes classification, Random forest
INTRODUCTION:
This study proposes to investigate the performance and accuracy on different classification methods using weak. A problem that occurs in the bioinformatics or the medical science is to reach the correct diagnosis of certain important medical information. For ultimate diagnosis generally many tests are done involving clustering and classification of the data. All of this testing procedures are necessary to reach the ultimate diagnosis. In the other hand too many tests will be complicated and the process is difficulty in obtaining the end results. This kind of Difficulties were resolved by using the various classification methods. Diabetes mellitus or simply diabetes is a set of related diseases in which the body cannot regulate the amount of sugar in blood level.
In every age group this is common .It charges plenty of money and its growing quickly. This is also known as metabolic disease or heredity diseases in which the person will have the high blood sugar level either the body will not produce enough insulin or the beta cells in body will not respond to the insulin that is produced in the pancreas. This is classified into the three types based on the symptoms polyuria, polydipsia and polyphagia. In the diabetes federation has claimed that presently 246 million people are suffering for diabetes worldwide and its number is expected to increase up to 380 million 2025.There are substantial amount of research has been done on the medical data with various algorithms such as bayes, J48 graft, c4.5, conjuctive rule learner.
RELATED WORK:
A good amount of data mining techniques applied in the medical diagnosis. Rahman et al, [1] proposed techniques with comparison different classification algorithms and different tools such as Weak, Tanagra and MATLAB to find the accuracy of single diabetes data sets with different tools and different classification algorithm. It achieves the 79.19%using Machine learning, 78.98% using Weka and 81.33% using J48 in weka. Intangara 83.85%, 100%, 90.63 using same ML, NB and J48.Kamath et al.2 identifies exploration of mining algorithm in diabetic patients database .Here the classification techniques are applied to classify the data and the data of the diabetes patient is evaluated with 10 fold cross validation and results comparison is done with that validation. It includes classification techniques such as Naïve bayes, k-star, One R, Simple cart. Which gives the accuracy such as 77.80%, 71.17%, 75.76% and 76.02.
Vincent et al.3 proposes performance measures for different classifiers. It includes ground truth index(GTI)for getting the classification accuracy. The comparison done with the theoretical point of view. The classification accuracy is done with different measures Jaccard’s coefficients. Mukeshet al.4 for prediction of diabetes using bayes classification. The data set are collected from the hospital and classification is used for analysis. It uses the trees J48 and bayes.net and gives the accuracy of 76.75% and 75.30 %using the single data mining tool WEKA. Keerthana et al.5 identifies the performance with both by using classification and clustering algorithm. It includes the various classifications such as Bayes.Net, Naïve Bayes, One R, clustering algorithm to achieve the accuracy of the dataset. Nadav et al.6for improving the confusion matrix values in classification. It demonstrates the benefits of methods by applying it to error correcting code using Adaboost with orthogonal array code matrix.
RESEARCH FRAMEWORK:
Our methodology adopted for the implementation of research problem which begins with the data collection. The training data set used for data mining is the Pima Indian diabetes databases from the UCI machine learning repository. As per the requirements the dataset is converted in the required form comparative study of algorithms is carried out to select the efficient one. Our objective of the classification is to assign the class to find previously unseen records as accurately as possible. The motive is to find a classification model based on the class attributes and to find the accuracy of the model. The given data set is divided with split percentage as training set and test set. The training set is to build a model and test set is used to validate the accuracy when the new data set arrives.
Figure 1: Architecture diagram of proposed work
NAÏVE BAYES:
The Bayesian classification represents a supervised learning as well as the statistical method for classification. Assumes an underlying probabilistic model and allows us to capture uncertainty about the model in a principled way of determining probabilities of the outcomes. It can solve diagnostic and predictive problems. This classification is named by the Thomas Bayes (1702-1761) who proposed the bayes theorem. Bayesian classification provides a useful perspectives for understanding and evaluating many learning algorithms. It calculates explicit probabilities for hypothesis and it’s robust to noise in input data. Bayesian classification is based on bayes theorem. Simple bayesian classification known as the naïve bayesian classifier to be comparable in performance with the decision tree and neural network classifiers. Bayesian classifiers have also exhibited high accuracy and speed when applied to the large datasets. Bayes theorem is useful in that it provides a way of calculating the posterior probability, P(H|X)from p(H),P(X),and p(X|H), Bayes theorem
p(H/X) = P(X|H)P(H) /P(X)………………………...(1)
Where p (X|H)is posterior probability of X on H,P(X) is the prior probability of X.P(H)which is independent on X.
RANDOM FOREST:
Random Forest is the supervised machine learning algorithm. Random Forest has tremendous technique potential of becoming of popular technique for future classifiers. Random forest algorithm is one of the best among the classification algorithm for classifying the large sets of data. It is also a combination of tree predictor where each tree depends on the values of a random vector sampled independently with same distribution for all trees in the forest. Introducing a right kind of randomness make them accurate classifiers. The algorithm was developed by Leo Breiman and Adele cutler. Random forest grows many classification trees.
1. N is the number of cases in the training set is N, N1 are sample cases at random but with the replacement from the original data (N-N1) .This sample will be the training set for growing the trees.
2. Let M be the input variables and a number m is specified such that at each mode, m variables are selected at random out of the M and the best split on these m is used to split the node. The value of the m is held constant during the forest growing.
3. Each tree is grown to the largest extent possible. Architecture diagram of proposed work
DATA SET DESCRIPTION:
The characteristics of the data set used in the research, the detailed description of the data sets are available in UCI repository[8].
Our objective of the data set is to diagnosis of diabetes data set based on the personal details of the patients such as Age, Blood pressure, Body Mass Index(BMI), Insulin, plasma it is useful to decide where the Pima Indian Diabetes Data(PIMA), belongs to the class tested positive or tested negative.
The data set available publicly from UCI machine learning repository. The problem posed here is to predict where the person would tested as positive or tested negative. This is referred as two class problem were 1 being interpreted as tested positive and 2 being interpreted as tested negative.500 belongs to the class 1 and 268 belongs to the class 2.
Table1: Dataset Description
|
Data set |
No. of example |
Inputs |
Classes |
Total Attributes |
Noisy |
|
PIMA |
768 |
7 |
2 |
8 |
NO |
The purpose of the study is to investigate the relationship between the diabetes and diagnostic results and a list of variables that represents the measurements and medical attributes that includes the 768 tuples with the input attributes of 7 and classes is 2 and total no of attributes as 8.The attributes that used for the classification is given below:
Plasma glucose concentration
Blood pressure (mm Hg)
Triceps skin fold thickness (mm)
Insulin (mm U/ml)
Body mass index (weight in kg/(height in m)^2)
Diabetes pedigree function
Age (years)
Class variable (positive 1 or negative 2)
EXPERIMENTAL SETUP IN WEKA:
The data mining is to make sense of large amounts of mostly unsupervised data, in some domain. Classification maps into predefined groups. It is often referred to as supervised learning as the classes are determined prior to examining the data. Two classes tested positive and tested negative are defined based on the data attribute value during the analysis of diabetes dataset.
PERFORMANCE METRICS:
We measure the performance of the classifiers with respect to different metrics like accuracy, precision, recall, F-measures, ROC curve and gamma statistics along with the confusion metrics. The TP is defined as the True positive which it is calculated by using the formula True positive (TP). TP rate is the ration between diagonal elements of confusion metrics to sum of relevant rows. Which refers the positive tuples that were correctly labeled by the classifier.
Similarly false positive (FP) is calculated by using the formula, FP rate is equal to ration between non diagonal element of the confusion metrics to sum of relevant row. Which refers the negative tuples that were incorrectly labeled by the classifier. True negative (TN) is calculated by using the formula, TN is the ration between Diagonal element of the confusion metrics to sum of relevant column. True negative refers the negative tuples correctly labeled by classifier. False negative (FN) is calculated by using, FN is the ratio between Nondiagonal element of the confusion metrics to sum of relevant column. False negative refers positive tuples that were incorrectly labeled by classifier.
PRECISION:
Precision in weka is defined as fraction of retrieved element that are relevant to find the precision. Precision of class A is diagonal element to sum of relevant column.
Precision of class B is non diagonal element to sum of relevant column. Recall is defined fraction of the elements that are relevant to the query that are successfully retrieved.
Recall =tp/tp+fn …………………………………….(2)
Accuracy is determined by using the formula
………………………..(3)
A measure that combines precision and recall is the harmonic mean of precision and recall the F-measure is defined by:
precisionrecall/precision + recall…………………...(4)
Evaluation the confusion matrix a teach iteration enables making decision regarding the next one against all classifier that should be added to the current code to demonstrate the benefits of the method by applying it to error correcting code orthogonal arrays as the basic code matrix.
Table 1: Decision Making
|
Positive |
Negative |
|
True positive |
False positive |
|
False negative |
True negative |
For two class matrix the true positive, true negative, false positive, false negative will be in the order which is given in the Table 1.
COMPARISION OF NAÏVE BAYES AND RANDOM FOREST ALGORITHM IN WEKA:
Our comparison is based on the time taken, accuracy and confusion matrix.
Table 2: Time taken between Train and Test
|
Time taken |
Naïve Bayes |
Random Forest |
|
Train |
0.02 seconds |
0.14 seconds |
|
Test |
0.02 seconds |
0.11 seconds |
Table 3: Classified Instances
|
Method |
Correctly Classified Instances |
|
Naïve Bayes Train |
76% |
|
Naïve Bayes Test |
75% |
|
Random Forest Train |
97% |
|
Random Forest Test |
97% |
|
Confusion Matrix:NaïveBayes |
Confusion Matrix:Naïve Bayes Test |
|
a b <-- classified as |
a b <-- classified as |
|
156 112 | a = tested_positive |
86 59 | a = tested_positive |
|
70 430 | b = tested_negative |
37 202 | b = tested_negativ |
|
Confusion Matrix:Random Forest Train |
Confusion Matrix:Random Forest Test |
|
a b <-- classified as |
a b <-- classified as |
|
266 2 | a = tested_positive |
143 2 | a = tested_positive |
|
15 485 | b = tested_negative |
9 230 | b = tested_negative |
Figure 2: Graphical representation of Time (seconds) for Train and Test Data set
Figure 3: Graphical representation of Classified accuracy
CONCLUSION:
In this research data mining technique applied to classify Diabetes data and predict the patient has chances of being affected by diabetes or not. Different types of classification algorithm applied to the single Pima Indian Diabetes Dataset and the above the results obtained tabulated in the table. The research can be extended by applying association mining. This work extends to utilize the implementation of different medical dataset.
REFERENCES:
1. Rahman, R. M. and Afroz, F. Comparison of various classification techniques using different data mining tools for diabetes diagnosis. Journal of Software Engineering and Applications, 2013; 6(03): 85-97.
2. R. S. Kamath, Weka Approach for Exploration Mining in Diabetic Patients Database, Chatrapati Shahu Institute of Business Education and Research Kolhapur,India.2013
3. Labatut, V and Cherifi, H. Evaluation of performance measures for classifiers comparison. Ubiquitous Computing and Communication Journal, 2011; 6, 2011:21-34
4. Kumari, M., Vohra, R., and Arora, A. Prediction of Diabetes Using Bayesian Network, International Journal of Computer Science and Information Technologies, 2014; 5(4) : 5174-5178.
5. Keerthana, G., and Srividhya, V. (2014). Performance Enhancement of Classifiers using Integration of Clustering and Classification Techniques. International Journal of Computer Science Engineering 2014;3(3) : 200-203.
6. Marom, N. D., Rokach, L., and Shmilovici, A. Using the confusion matrix for improving ensemble classifiers. In 26th Convention of Electrical and Electronics Engineers in Israel (IEEEI), 2010:555-559.
Received on 30.06.2017 Modified on 29.07.2017
Accepted on 11.08.2017 © RJPT All right reserved
Research J. Pharm. and Tech. 2018; 11(1): 97-100
DOI: 10.5958/0974-360X.2018.00018.5